Part I - Pisa data exploration¶

by Tatjana Damdinshaw¶

Introduction¶

PISA is a survey that already took place the 5th time in 2012 to assess the competencies in reading, mathematics and science (focus on mathematics) of 15 year-old students in 65 countries and economies. This dataset contains the information for each student.

Preliminary Wrangling¶

In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import re

%matplotlib inline
In [3]:
#Load dataset
df = pd.read_csv('pisa2012.csv', low_memory = False)
In [195]:
df.head()
Out[195]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU
0 1 Albania 80000 ALB0006 Non-OECD Albania 1 1 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
1 2 Albania 80000 ALB0006 Non-OECD Albania 1 2 10 1.0 ... 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
2 3 Albania 80000 ALB0006 Non-OECD Albania 1 3 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
3 4 Albania 80000 ALB0006 Non-OECD Albania 1 4 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13
4 5 Albania 80000 ALB0006 Non-OECD Albania 1 5 9 1.0 ... 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13

5 rows × 636 columns

In [196]:
print(df.info(verbose=True))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Data columns (total 636 columns):
 #    Column       Dtype  
---   ------       -----  
 0    Unnamed: 0   int64  
 1    CNT          object 
 2    SUBNATIO     int64  
 3    STRATUM      object 
 4    OECD         object 
 5    NC           object 
 6    SCHOOLID     int64  
 7    STIDSTD      int64  
 8    ST01Q01      int64  
 9    ST02Q01      float64
 10   ST03Q01      int64  
 11   ST03Q02      int64  
 12   ST04Q01      object 
 13   ST05Q01      object 
 14   ST06Q01      float64
 15   ST07Q01      object 
 16   ST07Q02      object 
 17   ST07Q03      object 
 18   ST08Q01      object 
 19   ST09Q01      object 
 20   ST115Q01     float64
 21   ST11Q01      object 
 22   ST11Q02      object 
 23   ST11Q03      object 
 24   ST11Q04      object 
 25   ST11Q05      object 
 26   ST11Q06      object 
 27   ST13Q01      object 
 28   ST14Q01      object 
 29   ST14Q02      object 
 30   ST14Q03      object 
 31   ST14Q04      object 
 32   ST15Q01      object 
 33   ST17Q01      object 
 34   ST18Q01      object 
 35   ST18Q02      object 
 36   ST18Q03      object 
 37   ST18Q04      object 
 38   ST19Q01      object 
 39   ST20Q01      object 
 40   ST20Q02      object 
 41   ST20Q03      object 
 42   ST21Q01      float64
 43   ST25Q01      object 
 44   ST26Q01      object 
 45   ST26Q02      object 
 46   ST26Q03      object 
 47   ST26Q04      object 
 48   ST26Q05      object 
 49   ST26Q06      object 
 50   ST26Q07      object 
 51   ST26Q08      object 
 52   ST26Q09      object 
 53   ST26Q10      object 
 54   ST26Q11      object 
 55   ST26Q12      object 
 56   ST26Q13      object 
 57   ST26Q14      object 
 58   ST26Q15      int64  
 59   ST26Q16      int64  
 60   ST26Q17      int64  
 61   ST27Q01      object 
 62   ST27Q02      object 
 63   ST27Q03      object 
 64   ST27Q04      object 
 65   ST27Q05      object 
 66   ST28Q01      object 
 67   ST29Q01      object 
 68   ST29Q02      object 
 69   ST29Q03      object 
 70   ST29Q04      object 
 71   ST29Q05      object 
 72   ST29Q06      object 
 73   ST29Q07      object 
 74   ST29Q08      object 
 75   ST35Q01      object 
 76   ST35Q02      object 
 77   ST35Q03      object 
 78   ST35Q04      object 
 79   ST35Q05      object 
 80   ST35Q06      object 
 81   ST37Q01      object 
 82   ST37Q02      object 
 83   ST37Q03      object 
 84   ST37Q04      object 
 85   ST37Q05      object 
 86   ST37Q06      object 
 87   ST37Q07      object 
 88   ST37Q08      object 
 89   ST42Q01      object 
 90   ST42Q02      object 
 91   ST42Q03      object 
 92   ST42Q04      object 
 93   ST42Q05      object 
 94   ST42Q06      object 
 95   ST42Q07      object 
 96   ST42Q08      object 
 97   ST42Q09      object 
 98   ST42Q10      object 
 99   ST43Q01      object 
 100  ST43Q02      object 
 101  ST43Q03      object 
 102  ST43Q04      object 
 103  ST43Q05      object 
 104  ST43Q06      object 
 105  ST44Q01      object 
 106  ST44Q03      object 
 107  ST44Q04      object 
 108  ST44Q05      object 
 109  ST44Q07      object 
 110  ST44Q08      object 
 111  ST46Q01      object 
 112  ST46Q02      object 
 113  ST46Q03      object 
 114  ST46Q04      object 
 115  ST46Q05      object 
 116  ST46Q06      object 
 117  ST46Q07      object 
 118  ST46Q08      object 
 119  ST46Q09      object 
 120  ST48Q01      object 
 121  ST48Q02      object 
 122  ST48Q03      object 
 123  ST48Q04      object 
 124  ST48Q05      object 
 125  ST49Q01      object 
 126  ST49Q02      object 
 127  ST49Q03      object 
 128  ST49Q04      object 
 129  ST49Q05      object 
 130  ST49Q06      object 
 131  ST49Q07      object 
 132  ST49Q09      object 
 133  ST53Q01      object 
 134  ST53Q02      object 
 135  ST53Q03      object 
 136  ST53Q04      object 
 137  ST55Q01      object 
 138  ST55Q02      object 
 139  ST55Q03      object 
 140  ST55Q04      object 
 141  ST57Q01      float64
 142  ST57Q02      float64
 143  ST57Q03      float64
 144  ST57Q04      float64
 145  ST57Q05      float64
 146  ST57Q06      float64
 147  ST61Q01      object 
 148  ST61Q02      object 
 149  ST61Q03      object 
 150  ST61Q04      object 
 151  ST61Q05      object 
 152  ST61Q06      object 
 153  ST61Q07      object 
 154  ST61Q08      object 
 155  ST61Q09      object 
 156  ST62Q01      object 
 157  ST62Q02      object 
 158  ST62Q03      object 
 159  ST62Q04      object 
 160  ST62Q06      object 
 161  ST62Q07      object 
 162  ST62Q08      object 
 163  ST62Q09      object 
 164  ST62Q10      object 
 165  ST62Q11      object 
 166  ST62Q12      object 
 167  ST62Q13      object 
 168  ST62Q15      object 
 169  ST62Q16      object 
 170  ST62Q17      object 
 171  ST62Q19      object 
 172  ST69Q01      float64
 173  ST69Q02      float64
 174  ST69Q03      float64
 175  ST70Q01      float64
 176  ST70Q02      float64
 177  ST70Q03      float64
 178  ST71Q01      float64
 179  ST72Q01      float64
 180  ST73Q01      object 
 181  ST73Q02      object 
 182  ST74Q01      object 
 183  ST74Q02      object 
 184  ST75Q01      object 
 185  ST75Q02      object 
 186  ST76Q01      object 
 187  ST76Q02      object 
 188  ST77Q01      object 
 189  ST77Q02      object 
 190  ST77Q04      object 
 191  ST77Q05      object 
 192  ST77Q06      object 
 193  ST79Q01      object 
 194  ST79Q02      object 
 195  ST79Q03      object 
 196  ST79Q04      object 
 197  ST79Q05      object 
 198  ST79Q06      object 
 199  ST79Q07      object 
 200  ST79Q08      object 
 201  ST79Q10      object 
 202  ST79Q11      object 
 203  ST79Q12      object 
 204  ST79Q15      object 
 205  ST79Q17      object 
 206  ST80Q01      object 
 207  ST80Q04      object 
 208  ST80Q05      object 
 209  ST80Q06      object 
 210  ST80Q07      object 
 211  ST80Q08      object 
 212  ST80Q09      object 
 213  ST80Q10      object 
 214  ST80Q11      object 
 215  ST81Q01      object 
 216  ST81Q02      object 
 217  ST81Q03      object 
 218  ST81Q04      object 
 219  ST81Q05      object 
 220  ST82Q01      object 
 221  ST82Q02      object 
 222  ST82Q03      object 
 223  ST83Q01      object 
 224  ST83Q02      object 
 225  ST83Q03      object 
 226  ST83Q04      object 
 227  ST84Q01      object 
 228  ST84Q02      object 
 229  ST84Q03      object 
 230  ST85Q01      object 
 231  ST85Q02      object 
 232  ST85Q03      object 
 233  ST85Q04      object 
 234  ST86Q01      object 
 235  ST86Q02      object 
 236  ST86Q03      object 
 237  ST86Q04      object 
 238  ST86Q05      object 
 239  ST87Q01      object 
 240  ST87Q02      object 
 241  ST87Q03      object 
 242  ST87Q04      object 
 243  ST87Q05      object 
 244  ST87Q06      object 
 245  ST87Q07      object 
 246  ST87Q08      object 
 247  ST87Q09      object 
 248  ST88Q01      object 
 249  ST88Q02      object 
 250  ST88Q03      object 
 251  ST88Q04      object 
 252  ST89Q02      object 
 253  ST89Q03      object 
 254  ST89Q04      object 
 255  ST89Q05      object 
 256  ST91Q01      object 
 257  ST91Q02      object 
 258  ST91Q03      object 
 259  ST91Q04      object 
 260  ST91Q05      object 
 261  ST91Q06      object 
 262  ST93Q01      object 
 263  ST93Q03      object 
 264  ST93Q04      object 
 265  ST93Q06      object 
 266  ST93Q07      object 
 267  ST94Q05      object 
 268  ST94Q06      object 
 269  ST94Q09      object 
 270  ST94Q10      object 
 271  ST94Q14      object 
 272  ST96Q01      object 
 273  ST96Q02      object 
 274  ST96Q03      object 
 275  ST96Q05      object 
 276  ST101Q01     float64
 277  ST101Q02     float64
 278  ST101Q03     float64
 279  ST101Q05     float64
 280  ST104Q01     float64
 281  ST104Q04     float64
 282  ST104Q05     float64
 283  ST104Q06     float64
 284  IC01Q01      object 
 285  IC01Q02      object 
 286  IC01Q03      object 
 287  IC01Q04      object 
 288  IC01Q05      object 
 289  IC01Q06      object 
 290  IC01Q07      object 
 291  IC01Q08      object 
 292  IC01Q09      object 
 293  IC01Q10      object 
 294  IC01Q11      object 
 295  IC02Q01      object 
 296  IC02Q02      object 
 297  IC02Q03      object 
 298  IC02Q04      object 
 299  IC02Q05      object 
 300  IC02Q06      object 
 301  IC02Q07      object 
 302  IC03Q01      object 
 303  IC04Q01      object 
 304  IC05Q01      int64  
 305  IC06Q01      int64  
 306  IC07Q01      int64  
 307  IC08Q01      object 
 308  IC08Q02      object 
 309  IC08Q03      object 
 310  IC08Q04      object 
 311  IC08Q05      object 
 312  IC08Q06      object 
 313  IC08Q07      object 
 314  IC08Q08      object 
 315  IC08Q09      object 
 316  IC08Q11      object 
 317  IC09Q01      object 
 318  IC09Q02      object 
 319  IC09Q03      object 
 320  IC09Q04      object 
 321  IC09Q05      object 
 322  IC09Q06      object 
 323  IC09Q07      object 
 324  IC10Q01      object 
 325  IC10Q02      object 
 326  IC10Q03      object 
 327  IC10Q04      object 
 328  IC10Q05      object 
 329  IC10Q06      object 
 330  IC10Q07      object 
 331  IC10Q08      object 
 332  IC10Q09      object 
 333  IC11Q01      object 
 334  IC11Q02      object 
 335  IC11Q03      object 
 336  IC11Q04      object 
 337  IC11Q05      object 
 338  IC11Q06      object 
 339  IC11Q07      object 
 340  IC22Q01      object 
 341  IC22Q02      object 
 342  IC22Q04      object 
 343  IC22Q06      object 
 344  IC22Q07      object 
 345  IC22Q08      object 
 346  EC01Q01      object 
 347  EC02Q01      object 
 348  EC03Q01      object 
 349  EC03Q02      object 
 350  EC03Q03      object 
 351  EC03Q04      object 
 352  EC03Q05      object 
 353  EC03Q06      object 
 354  EC03Q07      object 
 355  EC03Q08      object 
 356  EC03Q09      object 
 357  EC03Q10      object 
 358  EC04Q01A     float64
 359  EC04Q01B     float64
 360  EC04Q01C     float64
 361  EC04Q02A     float64
 362  EC04Q02B     float64
 363  EC04Q02C     float64
 364  EC04Q03A     float64
 365  EC04Q03B     float64
 366  EC04Q03C     float64
 367  EC04Q04A     float64
 368  EC04Q04B     float64
 369  EC04Q04C     float64
 370  EC04Q05A     float64
 371  EC04Q05B     float64
 372  EC04Q05C     float64
 373  EC04Q06A     float64
 374  EC04Q06B     float64
 375  EC04Q06C     float64
 376  EC05Q01      object 
 377  EC06Q01      object 
 378  EC07Q01      object 
 379  EC07Q02      object 
 380  EC07Q03      object 
 381  EC07Q04      object 
 382  EC07Q05      object 
 383  EC08Q01      object 
 384  EC08Q02      object 
 385  EC08Q03      object 
 386  EC08Q04      object 
 387  EC09Q03      object 
 388  EC10Q01      object 
 389  EC11Q02      object 
 390  EC11Q03      object 
 391  EC12Q01      object 
 392  ST22Q01      object 
 393  ST23Q01      object 
 394  ST23Q02      object 
 395  ST23Q03      object 
 396  ST23Q04      object 
 397  ST23Q05      object 
 398  ST23Q06      object 
 399  ST23Q07      object 
 400  ST23Q08      object 
 401  ST24Q01      object 
 402  ST24Q02      object 
 403  ST24Q03      object 
 404  CLCUSE1      object 
 405  CLCUSE301    int64  
 406  CLCUSE302    int64  
 407  DEFFORT      int64  
 408  QUESTID      object 
 409  BOOKID       object 
 410  EASY         object 
 411  AGE          float64
 412  GRADE        float64
 413  PROGN        object 
 414  ANXMAT       float64
 415  ATSCHL       float64
 416  ATTLNACT     float64
 417  BELONG       float64
 418  BFMJ2        float64
 419  BMMJ1        float64
 420  CLSMAN       float64
 421  COBN_F       object 
 422  COBN_M       object 
 423  COBN_S       object 
 424  COGACT       float64
 425  CULTDIST     float64
 426  CULTPOS      float64
 427  DISCLIMA     float64
 428  ENTUSE       float64
 429  ESCS         float64
 430  EXAPPLM      float64
 431  EXPUREM      float64
 432  FAILMAT      float64
 433  FAMCON       float64
 434  FAMCONC      float64
 435  FAMSTRUC     float64
 436  FISCED       object 
 437  HEDRES       float64
 438  HERITCUL     float64
 439  HISCED       object 
 440  HISEI        float64
 441  HOMEPOS      float64
 442  HOMSCH       float64
 443  HOSTCUL      float64
 444  ICTATTNEG    float64
 445  ICTATTPOS    float64
 446  ICTHOME      float64
 447  ICTRES       float64
 448  ICTSCH       float64
 449  IMMIG        object 
 450  INFOCAR      float64
 451  INFOJOB1     float64
 452  INFOJOB2     float64
 453  INSTMOT      float64
 454  INTMAT       float64
 455  ISCEDD       object 
 456  ISCEDL       object 
 457  ISCEDO       object 
 458  LANGCOMM     float64
 459  LANGN        object 
 460  LANGRPPD     float64
 461  LMINS        float64
 462  MATBEH       float64
 463  MATHEFF      float64
 464  MATINTFC     float64
 465  MATWKETH     float64
 466  MISCED       object 
 467  MMINS        float64
 468  MTSUP        float64
 469  OCOD1        object 
 470  OCOD2        object 
 471  OPENPS       float64
 472  OUTHOURS     float64
 473  PARED        float64
 474  PERSEV       float64
 475  REPEAT       object 
 476  SCMAT        float64
 477  SMINS        float64
 478  STUDREL      float64
 479  SUBNORM      float64
 480  TCHBEHFA     float64
 481  TCHBEHSO     float64
 482  TCHBEHTD     float64
 483  TEACHSUP     float64
 484  TESTLANG     object 
 485  TIMEINT      float64
 486  USEMATH      float64
 487  USESCH       float64
 488  WEALTH       float64
 489  ANCATSCHL    float64
 490  ANCATTLNACT  float64
 491  ANCBELONG    float64
 492  ANCCLSMAN    float64
 493  ANCCOGACT    float64
 494  ANCINSTMOT   float64
 495  ANCINTMAT    float64
 496  ANCMATWKETH  float64
 497  ANCMTSUP     float64
 498  ANCSCMAT     float64
 499  ANCSTUDREL   float64
 500  ANCSUBNORM   float64
 501  PV1MATH      float64
 502  PV2MATH      float64
 503  PV3MATH      float64
 504  PV4MATH      float64
 505  PV5MATH      float64
 506  PV1MACC      float64
 507  PV2MACC      float64
 508  PV3MACC      float64
 509  PV4MACC      float64
 510  PV5MACC      float64
 511  PV1MACQ      float64
 512  PV2MACQ      float64
 513  PV3MACQ      float64
 514  PV4MACQ      float64
 515  PV5MACQ      float64
 516  PV1MACS      float64
 517  PV2MACS      float64
 518  PV3MACS      float64
 519  PV4MACS      float64
 520  PV5MACS      float64
 521  PV1MACU      float64
 522  PV2MACU      float64
 523  PV3MACU      float64
 524  PV4MACU      float64
 525  PV5MACU      float64
 526  PV1MAPE      float64
 527  PV2MAPE      float64
 528  PV3MAPE      float64
 529  PV4MAPE      float64
 530  PV5MAPE      float64
 531  PV1MAPF      float64
 532  PV2MAPF      float64
 533  PV3MAPF      float64
 534  PV4MAPF      float64
 535  PV5MAPF      float64
 536  PV1MAPI      float64
 537  PV2MAPI      float64
 538  PV3MAPI      float64
 539  PV4MAPI      float64
 540  PV5MAPI      float64
 541  PV1READ      float64
 542  PV2READ      float64
 543  PV3READ      float64
 544  PV4READ      float64
 545  PV5READ      float64
 546  PV1SCIE      float64
 547  PV2SCIE      float64
 548  PV3SCIE      float64
 549  PV4SCIE      float64
 550  PV5SCIE      float64
 551  W_FSTUWT     float64
 552  W_FSTR1      float64
 553  W_FSTR2      float64
 554  W_FSTR3      float64
 555  W_FSTR4      float64
 556  W_FSTR5      float64
 557  W_FSTR6      float64
 558  W_FSTR7      float64
 559  W_FSTR8      float64
 560  W_FSTR9      float64
 561  W_FSTR10     float64
 562  W_FSTR11     float64
 563  W_FSTR12     float64
 564  W_FSTR13     float64
 565  W_FSTR14     float64
 566  W_FSTR15     float64
 567  W_FSTR16     float64
 568  W_FSTR17     float64
 569  W_FSTR18     float64
 570  W_FSTR19     float64
 571  W_FSTR20     float64
 572  W_FSTR21     float64
 573  W_FSTR22     float64
 574  W_FSTR23     float64
 575  W_FSTR24     float64
 576  W_FSTR25     float64
 577  W_FSTR26     float64
 578  W_FSTR27     float64
 579  W_FSTR28     float64
 580  W_FSTR29     float64
 581  W_FSTR30     float64
 582  W_FSTR31     float64
 583  W_FSTR32     float64
 584  W_FSTR33     float64
 585  W_FSTR34     float64
 586  W_FSTR35     float64
 587  W_FSTR36     float64
 588  W_FSTR37     float64
 589  W_FSTR38     float64
 590  W_FSTR39     float64
 591  W_FSTR40     float64
 592  W_FSTR41     float64
 593  W_FSTR42     float64
 594  W_FSTR43     float64
 595  W_FSTR44     float64
 596  W_FSTR45     float64
 597  W_FSTR46     float64
 598  W_FSTR47     float64
 599  W_FSTR48     float64
 600  W_FSTR49     float64
 601  W_FSTR50     float64
 602  W_FSTR51     float64
 603  W_FSTR52     float64
 604  W_FSTR53     float64
 605  W_FSTR54     float64
 606  W_FSTR55     float64
 607  W_FSTR56     float64
 608  W_FSTR57     float64
 609  W_FSTR58     float64
 610  W_FSTR59     float64
 611  W_FSTR60     float64
 612  W_FSTR61     float64
 613  W_FSTR62     float64
 614  W_FSTR63     float64
 615  W_FSTR64     float64
 616  W_FSTR65     float64
 617  W_FSTR66     float64
 618  W_FSTR67     float64
 619  W_FSTR68     float64
 620  W_FSTR69     float64
 621  W_FSTR70     float64
 622  W_FSTR71     float64
 623  W_FSTR72     float64
 624  W_FSTR73     float64
 625  W_FSTR74     float64
 626  W_FSTR75     float64
 627  W_FSTR76     float64
 628  W_FSTR77     float64
 629  W_FSTR78     float64
 630  W_FSTR79     float64
 631  W_FSTR80     float64
 632  WVARSTRR     int64  
 633  VAR_UNIT     int64  
 634  SENWGT_STU   float64
 635  VER_STU      object 
dtypes: float64(250), int64(18), object(368)
memory usage: 2.3+ GB
None
In [197]:
print(df.shape)
print(df.dtypes.head(30))
(485490, 636)
Unnamed: 0      int64
CNT            object
SUBNATIO        int64
STRATUM        object
OECD           object
NC             object
SCHOOLID        int64
STIDSTD         int64
ST01Q01         int64
ST02Q01       float64
ST03Q01         int64
ST03Q02         int64
ST04Q01        object
ST05Q01        object
ST06Q01       float64
ST07Q01        object
ST07Q02        object
ST07Q03        object
ST08Q01        object
ST09Q01        object
ST115Q01      float64
ST11Q01        object
ST11Q02        object
ST11Q03        object
ST11Q04        object
ST11Q05        object
ST11Q06        object
ST13Q01        object
ST14Q01        object
ST14Q02        object
dtype: object
In [198]:
df.describe()
Out[198]:
Unnamed: 0 SUBNATIO SCHOOLID STIDSTD ST01Q01 ST02Q01 ST03Q01 ST03Q02 ST06Q01 ST115Q01 ... W_FSTR74 W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU
count 485490.000000 4.854900e+05 485490.000000 485490.000000 485490.000000 485438.000000 485490.000000 485490.000000 457994.000000 479269.000000 ... 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000 485490.000000
mean 242745.500000 4.315457e+06 240.152197 6134.066201 9.813323 2.579260 6.558512 1996.070061 6.148963 1.265356 ... 50.844201 51.020378 50.943149 50.685275 51.019842 50.540724 50.721164 40.013920 1.531189 0.140054
std 140149.035431 2.524434e+06 278.563016 6733.144944 3.734726 2.694013 3.705244 0.255250 0.970693 0.578992 ... 120.684726 122.946533 121.170883 119.267686 122.981541 119.479516 119.799018 22.951264 0.539759 0.137864
min 1.000000 8.000000e+04 1.000000 1.000000 7.000000 1.000000 1.000000 1996.000000 4.000000 1.000000 ... 0.292900 0.292900 0.292900 0.292900 0.292900 0.292900 0.292900 1.000000 1.000000 0.000500
25% 121373.250000 2.030000e+06 61.000000 1811.000000 9.000000 1.000000 4.000000 1996.000000 6.000000 1.000000 ... 4.660300 4.664800 4.643100 4.667000 4.675200 4.651850 4.660300 20.000000 1.000000 0.037800
50% 242745.500000 4.100000e+06 136.000000 3740.000000 10.000000 1.000000 7.000000 1996.000000 6.000000 1.000000 ... 13.637700 13.698900 13.611700 13.672100 13.731100 13.582000 13.600200 40.000000 2.000000 0.145200
75% 364117.750000 6.880000e+06 291.000000 7456.000000 10.000000 3.000000 9.000000 1996.000000 7.000000 1.000000 ... 41.233500 41.512500 41.695200 41.097300 41.189600 41.290925 41.356000 60.000000 2.000000 0.199900
max 485490.000000 8.580000e+06 1471.000000 33806.000000 96.000000 25.000000 99.000000 1997.000000 16.000000 4.000000 ... 2476.566800 4155.283000 3743.450100 3232.163700 3904.868100 3607.478300 3412.174100 80.000000 3.000000 5.095500

8 rows × 268 columns

In [5]:
#Filter columns by math work ethics
pattern = re.compile(r'st42', re.IGNORECASE)
columns = df.filter(regex=pattern)
columns
Out[5]:
ST42Q01 ST42Q02 ST42Q03 ST42Q04 ST42Q05 ST42Q06 ST42Q07 ST42Q08 ST42Q09 ST42Q10
0 Agree Disagree Agree Agree Agree Agree Agree Disagree Disagree Disagree
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN Strongly agree Disagree Agree Agree Disagree Strongly agree Disagree Agree Agree
4 Strongly agree Strongly agree Agree Strongly agree Strongly agree Disagree Disagree Disagree Agree Agree
... ... ... ... ... ... ... ... ... ... ...
485485 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
485486 Agree Disagree Disagree Agree Disagree Agree Agree Disagree Agree Disagree
485487 Agree Disagree Disagree Agree Disagree Agree Disagree Agree Disagree Agree
485488 Disagree Disagree Strongly disagree Disagree Disagree Agree Agree Disagree Disagree Strongly agree
485489 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

485490 rows × 10 columns

In [200]:
df['ST04Q01'].value_counts()
Out[200]:
Female    245064
Male      240426
Name: ST04Q01, dtype: int64
In [201]:
#Get different values for one column
columns['ST46Q01'].value_counts()
Out[201]:
Agree                148211
Strongly agree        77022
Disagree              70473
Strongly disagree     18192
Name: ST46Q01, dtype: int64
In [4]:
ordinal_var_dict = {'ST42Q01': ['Strongly agree','Agree','Disagree','Strongly disagree'], 
                   'ST42Q03': ['Strongly agree','Agree','Disagree','Strongly disagree'],
                   'ST42Q05': ['Strongly agree','Agree','Disagree','Strongly disagree'],
                   'ST42Q08': ['Strongly agree','Agree','Disagree','Strongly disagree'],
                   'ST42Q10': ['Strongly agree','Agree','Disagree','Strongly disagree']}

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    df[var] = df[var].astype(ordered_var)
In [203]:
pattern = re.compile(r'W_', re.IGNORECASE)
columns = df.filter(regex=pattern)
columns
Out[203]:
W_FSTUWT W_FSTR1 W_FSTR2 W_FSTR3 W_FSTR4 W_FSTR5 W_FSTR6 W_FSTR7 W_FSTR8 W_FSTR9 ... W_FSTR71 W_FSTR72 W_FSTR73 W_FSTR74 W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80
0 8.9096 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 ... 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829
1 8.9096 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 ... 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829
2 8.4871 12.7307 12.7307 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 4.2436 ... 12.7307 12.7307 4.2436 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307
3 8.4871 12.7307 12.7307 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 4.2436 ... 12.7307 12.7307 4.2436 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307
4 8.4871 12.7307 12.7307 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 4.2436 ... 12.7307 12.7307 4.2436 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
485485 62.4825 93.7238 31.2413 93.7238 31.2413 31.2413 93.7238 93.7238 31.2413 31.2413 ... 31.2413 93.7238 31.2413 93.7238 31.2413 93.7238 93.7238 93.7238 93.7238 31.2413
485486 65.7647 96.0036 33.9163 96.0036 33.9163 33.9163 96.0036 96.0036 33.9163 33.9163 ... 33.9163 96.0036 33.9163 96.0036 33.9163 96.0036 96.0036 96.0036 96.0036 33.9163
485487 65.7647 96.0036 33.9163 96.0036 33.9163 33.9163 96.0036 96.0036 33.9163 33.9163 ... 33.9163 96.0036 33.9163 96.0036 33.9163 96.0036 96.0036 96.0036 96.0036 33.9163
485488 65.7647 96.0036 33.9163 96.0036 33.9163 33.9163 96.0036 96.0036 33.9163 33.9163 ... 33.9163 96.0036 33.9163 96.0036 33.9163 96.0036 96.0036 96.0036 96.0036 33.9163
485489 62.4825 93.7238 31.2413 93.7238 31.2413 31.2413 93.7238 93.7238 31.2413 31.2413 ... 31.2413 93.7238 31.2413 93.7238 31.2413 93.7238 93.7238 93.7238 93.7238 31.2413

485490 rows × 81 columns

In [204]:
df['ANXMAT'].isna().sum()
Out[204]:
170726
In [205]:
df['ANXMAT']
Out[205]:
0         0.32
1          NaN
2          NaN
3         0.31
4         1.02
          ... 
485485     NaN
485486   -0.20
485487    0.32
485488   -0.20
485489     NaN
Name: ANXMAT, Length: 485490, dtype: float64

For the visualization and anserwing the following questions the mathematics and total score for each student needs to be determined by calculating the mean of plausible values for the math section. The total score is determined by calculating the mean of all plausible values.

In [6]:
#Determine math score
math_pattern = re.compile(r'PV\dMA', re.IGNORECASE)
math_columns = df.filter(regex=math_pattern)
math_mean = math_columns.mean(axis = 1)
print(math_columns)
print(math_mean)
         PV1MATH   PV2MATH   PV3MATH   PV4MATH   PV5MATH   PV1MACC   PV2MACC  \
0       406.8469  376.4683  344.5319  321.1637  381.9209  325.8374  324.2795   
1       486.1427  464.3325  453.4273  472.9008  476.0165  325.6816  419.9330   
2       533.2684  481.0796  489.6479  490.4269  533.2684  611.1622  486.5322   
3       412.2215  498.6836  415.3373  466.7472  454.2842  538.4094  511.9255   
4       381.9209  328.1742  403.7311  418.5309  395.1628  373.3525  293.1220   
...          ...       ...       ...       ...       ...       ...       ...   
485485  477.1849  493.5426  479.5217  486.5322  494.3215  507.5635  480.3007   
485486  518.9360  515.8202  505.6940  596.8297  508.8098  592.1561  491.6732   
485487  475.2376  482.2480  507.9530  457.3220  508.7319  557.8050  453.4273   
485488  550.9503  517.4560  529.1401  515.8981  501.0983  574.3184  554.0661   
485489  470.0187  441.1980  475.4713  441.9769  443.5348  467.6819  444.3138   

         PV3MACC   PV4MACC   PV5MACC  ...   PV1MAPF   PV2MAPF   PV3MAPF  \
0       279.8800  267.4170  312.5954  ...  319.6059  345.3108  360.8895   
1       378.6493  359.9548  384.1019  ...  411.3647  437.8486  457.3220   
2       567.5417  541.0578  544.9525  ...  580.7836  481.0796  555.0787   
3       553.9882  483.8838  479.2102  ...  534.5147  455.8420  504.1362   
4       364.0053  430.2150  403.7311  ...  432.5518  431.7729  399.0575   
...          ...       ...       ...  ...       ...       ...       ...   
485485  556.6365  502.8899  452.2589  ...  576.1100  488.8690  456.1536   
485486  542.3041  556.3250  576.5773  ...  546.9777  581.2510  560.2197   
485487  546.8998  514.1845  514.1845  ...  552.3524  514.1845  479.9112   
485488  593.0129  495.6457  557.1818  ...  551.7292  469.9408  608.5917   
485489  462.2293  436.5244  421.7246  ...  462.2293  460.6714  454.4399   

         PV4MAPF   PV5MAPF   PV1MAPI   PV2MAPI   PV3MAPI   PV4MAPI   PV5MAPI  
0       390.4892  322.7216  290.7852  345.3108  326.6163  407.6258  367.1210  
1       454.2063  460.4378  434.7328  448.7537  494.7110  429.2803  434.7328  
2       453.8168  491.2058  527.0369  444.4695  516.1318  403.9648  476.4060  
3       454.2842  483.8838  521.2728  481.5470  503.3572  469.8629  478.4312  
4       369.4579  341.4161  297.0167  353.8791  347.6476  314.1533  311.0375  
...          ...       ...       ...       ...       ...       ...       ...  
485485  530.9316  416.4278  527.0369  463.1640  423.4382  515.3529  397.7333  
485486  567.2301  574.2405  470.6418  472.9787  476.0944  443.3790  470.6418  
485487  421.4909  493.1531  489.2585  472.1218  458.1010  440.1854  488.4795  
485488  541.6031  554.0661  462.9304  428.6571  483.1827  443.4569  521.3507  
485489  438.0823  408.4826  403.8090  408.4826  431.8508  394.4618  374.2094  

[485490 rows x 40 columns]
0         355.183832
1         432.240230
2         512.509733
3         510.640287
4         378.980370
             ...    
485485    482.754318
485486    527.075863
485487    489.063728
485488    526.355352
485489    424.509273
Length: 485490, dtype: float64
In [8]:
#Add mean math score to data frame
df['math_score'] = math_mean
In [208]:
df.tail()
Out[208]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU math_score
485485 485486 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4955 10 3.0 ... 93.7238 93.7238 93.7238 93.7238 31.2413 41 1 0.0653 22NOV13 482.754318
485486 485487 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4956 10 3.0 ... 96.0036 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 527.075863
485487 485488 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4957 10 3.0 ... 96.0036 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 489.063728
485488 485489 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4958 10 3.0 ... 96.0036 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 526.355352
485489 485490 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4959 10 3.0 ... 93.7238 93.7238 93.7238 93.7238 31.2413 41 1 0.0653 22NOV13 424.509273

5 rows × 637 columns

In [9]:
#Determine total score
total_pattern = re.compile(r'PV', re.IGNORECASE)
total_columns = df.filter(regex=total_pattern)
total_mean = total_columns.mean(axis = 1)
print(total_columns)
print(total_mean)
         PV1MATH   PV2MATH   PV3MATH   PV4MATH   PV5MATH   PV1MACC   PV2MACC  \
0       406.8469  376.4683  344.5319  321.1637  381.9209  325.8374  324.2795   
1       486.1427  464.3325  453.4273  472.9008  476.0165  325.6816  419.9330   
2       533.2684  481.0796  489.6479  490.4269  533.2684  611.1622  486.5322   
3       412.2215  498.6836  415.3373  466.7472  454.2842  538.4094  511.9255   
4       381.9209  328.1742  403.7311  418.5309  395.1628  373.3525  293.1220   
...          ...       ...       ...       ...       ...       ...       ...   
485485  477.1849  493.5426  479.5217  486.5322  494.3215  507.5635  480.3007   
485486  518.9360  515.8202  505.6940  596.8297  508.8098  592.1561  491.6732   
485487  475.2376  482.2480  507.9530  457.3220  508.7319  557.8050  453.4273   
485488  550.9503  517.4560  529.1401  515.8981  501.0983  574.3184  554.0661   
485489  470.0187  441.1980  475.4713  441.9769  443.5348  467.6819  444.3138   

         PV3MACC   PV4MACC   PV5MACC  ...   PV1READ   PV2READ   PV3READ  \
0       279.8800  267.4170  312.5954  ...  249.5762  254.3420  406.8496   
1       378.6493  359.9548  384.1019  ...  406.2936  349.8975  400.7334   
2       567.5417  541.0578  544.9525  ...  401.2100  404.3872  387.7067   
3       553.9882  483.8838  479.2102  ...  547.3630  481.4353  461.5776   
4       364.0053  430.2150  403.7311  ...  311.7707  141.7883  293.5015   
...          ...       ...       ...  ...       ...       ...       ...   
485485  556.6365  502.8899  452.2589  ...  460.2272  476.1134  472.9362   
485486  542.3041  556.3250  576.5773  ...  490.9325  479.7053  448.4294   
485487  546.8998  514.1845  514.1845  ...  462.6239  514.7503  434.5558   
485488  593.0129  495.6457  557.1818  ...  505.2873  522.1282  513.3068   
485489  462.2293  436.5244  421.7246  ...  532.3506  483.1034  479.9261   

         PV4READ   PV5READ   PV1SCIE   PV2SCIE   PV3SCIE   PV4SCIE   PV5SCIE  
0       175.7053  218.5981  341.7009  408.8400  348.2283  367.8105  392.9877  
1       369.7553  396.7618  548.9929  471.5964  471.5964  443.6218  454.8116  
2       431.3938  401.2100  499.6643  428.7952  492.2044  512.7191  499.6643  
3       425.0393  471.9036  438.6796  481.5740  448.9370  474.1141  426.5573  
4       272.8495  260.1405  361.5628  275.7740  372.7527  403.5248  422.1746  
...          ...       ...       ...       ...       ...       ...       ...  
485485  472.1419  481.6736  559.8098  528.1052  519.7128  535.5651  538.3626  
485486  565.5134  451.6372  538.7355  493.9761  493.0436  561.1153  535.0056  
485487  457.8122  511.5425  536.8706  571.3726  488.3812  548.9929  563.9127  
485488  528.5437  522.9301  511.0407  532.4879  524.0955  551.1376  514.7706  
485489  459.2741  488.6635  530.6229  473.7411  477.4711  477.4711  505.4457  

[485490 rows x 50 columns]
0         347.439838
1         432.073398
2         499.186886
3         501.655846
4         365.501084
             ...    
485485    487.096410
485486    522.822568
485487    493.067276
485488    525.598850
485489    437.768810
Length: 485490, dtype: float64
In [10]:
#Add total score to data frame
df['total_score'] = total_mean
In [211]:
df.tail()
Out[211]:
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ... W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU math_score total_score
485485 485486 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4955 10 3.0 ... 93.7238 93.7238 93.7238 31.2413 41 1 0.0653 22NOV13 482.754318 487.096410
485486 485487 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4956 10 3.0 ... 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 527.075863 522.822568
485487 485488 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4957 10 3.0 ... 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 489.063728 493.067276
485488 485489 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4958 10 3.0 ... 96.0036 96.0036 96.0036 33.9163 41 1 0.0688 22NOV13 526.355352 525.598850
485489 485490 Vietnam 7040000 VNM0317 Non-OECD Viet Nam 162 4959 10 3.0 ... 93.7238 93.7238 93.7238 31.2413 41 1 0.0653 22NOV13 424.509273 437.768810

5 rows × 638 columns

What is the structure of your dataset?¶

There are 485490 student PISA results with 636 features. Most of the features are answers to survey questions answered by the participating students and are ordered factor variables with Agree, Strongly agree, Disagree, Strongly disagree or a numeric scale. Numeric variables are the resulting score in the different categories.

What is/are the main feature(s) of interest in your dataset?¶

The influence of math anxiety and out-of-school learning hours towards the math PISA score in general and per gender.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

I expect that students with a high anxiety towards math score lower in math. I also expect students which put a lot of effort into learning hours outside of school score higher in total.

Univariate Exploration¶

I start inspecting the distribution of the main variable of interest the math and total score.

In [11]:
#Math score
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['math_score'].max()+100, 100)
plt.hist(data = df, x = 'math_score', bins=bins)
plt.xlabel('Math score')
plt.title('Math score distribution')
plt.show()

The math score is almost normally distributed.

In [12]:
#Total score
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['total_score'].max()+100, 100)
plt.hist(data = df, x = 'total_score')
plt.xlabel('Total score')
plt.title('Total score distribution')
plt.show()

The total score is also almost normally distributed. Next I'll have a look at the other distributions interesting for my analysis view the relationship with total score and the math score in the next step:

  • Anxiety towards mathematics
  • Learning hours outside of school
  • Math anxiety questions
In [13]:
#Visualize math axienty score
plt.figure(figsize=[14, 6])
bins = np.arange(df['ANXMAT'].min()-0.5, df['ANXMAT'].max()+0.5, 0.5)
plt.hist(data = df, x = 'ANXMAT', bins=bins, edgecolor = 'black')
plt.xlabel('Anxiety')
plt.title('Math anxiety score distribution')
plt.show()
In [14]:
#Visualize with smaller bin size
plt.figure(figsize=[14,6])
bins = np.arange(df['ANXMAT'].min()-0.1, df['ANXMAT'].max()+0.1, 0.1)
plt.hist(data = df, x = 'ANXMAT', bins=bins)
plt.xlabel('Anxiety')
plt.title('Math anxiety score distribution')
plt.show()

The anxiety score is also normally distributed.

In [15]:
#Visualize learning hours outside of school on standard scale
plt.figure(figsize=[14, 6])
bins = np.arange(0, df['OUTHOURS'].max()+1, 1)
plt.hist(data = df, x = 'OUTHOURS', bins=bins)
plt.xlim([0,80])
plt.xlabel('Learning hours outside of school')
plt.title('Out-of-school learning hours distribution')
plt.show()
In [167]:
#Determine the possible limit
df.query('OUTHOURS>=80')['OUTHOURS'].sum()
Out[167]:
42810.0
In [168]:
df.query('OUTHOURS<80')['OUTHOURS'].sum()
Out[168]:
3386125.0
In [16]:
#Learning hours outside of school on log scale
plt.figure(figsize=[14, 6])
bins = 10 ** np.arange(0, np.log10(df['OUTHOURS'].max())+0.25, 0.25)
plt.hist(data = df, x = 'OUTHOURS', bins=bins)
plt.xscale('log')
plt.xlabel('Learning hours outside of school on log scale')
plt.title('Out-of-school learning hours distribution with log scale')
plt.show()

Outside hours has a long-tailed distribution, with a lot of students on the lower end of learning hours out of school, and few learning a lot of hours outside of school. When plotted on a log-scale, the price distribution looks normally distributed. Next I will investigate the specific survey questions related to anxiety.

Rubric Tip: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

In [25]:
fig, ax = plt.subplots(ncols=2, nrows=3, figsize = [12,12])

default_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'ST42Q01', color = default_color, ax=ax[0,0])
ax[0,0].set_xlabel('Math Anxiety - Worry That It Will Be Difficult')
sb.countplot(data = df, x = 'ST42Q03', color = default_color, ax=ax[0,1])
ax[0,1].set_xlabel('Math Anxiety - Get Very Tense')
sb.countplot(data = df, x = 'ST42Q05', color = default_color, ax=ax[1,0])
ax[1,0].set_xlabel('Math Anxiety - Get Very Nervous')
sb.countplot(data = df, x = 'ST42Q08', color = default_color, ax=ax[1,1])
ax[1,1].set_xlabel('Math Anxiety - Feel Helpless')
sb.countplot(data = df, x = 'ST42Q10', color = default_color, ax=ax[2,0])
ax[2,0].set_xlabel('Math Anxiety - Worry About Getting Poor <Grades>')

fig.suptitle('Math anxiety distribution per question')

plt.show()

Students agree and strongly agree with the questions Worry that it will be difficult and Worry about getting poor grades. Students disagree with the emotians of feeling very nervous, helpless, and getting very tense.

In [50]:
#Gender
plt.figure(figsize=[14, 8])
default_color = sb.color_palette()[0]
sb.countplot(data = df, x = 'ST04Q01', color = default_color)
plt.title('Gender distribution')

plt.show()

A little bit more female than male students participated in the study.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

The first variables I have looked into in my exploration (total score, math score and anxiety score are normally distributed. The out-of-school learning hours have a long-tailed distribution to the right. When applying a log scale the distribution gets normally distributed as well.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

I calculated the math score and total score per student in building the mean from the relavent values per student and added the value to the data frame in the additional columns math_score and total_score.

Bivariate Exploration¶

I am starting the bivariate exploration in having a look at the relation between the math score and the anxiety score.

In [23]:
categoric_vars = ['ST42Q01', 'ST42Q03', 'ST42Q05', 'ST42Q08', 'ST42Q10']
In [49]:
#Math score vs. anxiety score
plt.figure(figsize = [14, 8])
plt.scatter(data = df, x = 'ANXMAT', y = 'math_score', s = 1)
plt.xlabel('Anxiety score')
plt.ylabel('Math score')
plt.title('Math score vs. Anxiety score')
plt.show()

The visualization shows that there is only a slight correlation between math anxiety and the math score. The more anxies students feel about math they score a bit low in their total and math score.

In [48]:
#Math score vs. gender

plt.figure(figsize = [14, 8])
sb.boxplot(data = df, x = 'ST04Q01', y = 'math_score',color = default_color)
plt.ylabel('Math score')
plt.xlabel('Gender')
plt.title('Math score per gender')

plt.show()

Male students are slightly scoring better in math than female students.

In [47]:
#Math anxiety vs. gender
plt.figure(figsize = [14, 8])
sb.violinplot(data = df, x = 'ST04Q01', y = 'ANXMAT',color = default_color)
plt.ylabel('Math anxiety score')
plt.xlabel('Gender')
plt.title('Math anxiety per gender')

plt.show()

Female students feeling more anxious about math than male students.

In [46]:
#Math score vs. learning hours outside of school 
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'math_score', jitter=0.3, size = 1)
plt.xlim([0, 80])
plt.xlabel('Learning hours outside of school')
plt.xticks([0, 20, 40, 60, 80])
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours')
plt.show()
In [45]:
#Math score vs. learning hours outside of school with log scale
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'math_score', jitter=0.3, size = 1)
plt.xlabel('Learning hours outside of school')
plt.xscale('log')
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours with log scale')
plt.show()

The above visualizations show not a strong correlation between more learning hours outside of school and scoring better in math. Next I will have a look at the total score vs. the out-of-school learning hours.

In [44]:
#Total score vs. learning hours outside of school
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'total_score', size = 1, jitter=0.3)
plt.xlim([0, 80])
plt.xlabel('Learning hours outside of school')
#plt.xscale('log')
plt.xticks([0, 20, 40, 60, 80])
plt.ylabel('Total score')
plt.title('Total score vs. Out-of-school learning hours')
plt.show()
In [43]:
#Total score vs. learning hours outside of school with log scale
plt.figure(figsize = [14, 8])
sb.stripplot(data = df, x = 'OUTHOURS', y = 'total_score', size = 1, jitter=0.3)
plt.xlabel('Learning hours outside of school')
plt.xscale('log')
plt.ylabel('Total score')
plt.title('Total score vs. Out-of-school learning hours with log scale')
plt.show()

The same as for the math score in relation to the out-of-school learning hours applies. Students are not scoring better the more time they are learning outside of school. Next I will have a closer look at the relation between math score and total score since the resulting visualization above had almost the same outcome.

In [42]:
#Visualize Math score vs. Total score
plt.figure(figsize = [14, 8])
plt.scatter(data = df, x = 'math_score', y = 'total_score', s = 1)
plt.xlabel('Math score')
plt.ylabel('Total score')
plt.title('Math score vs. Total score')
plt.show()

The relationship between math and total score is linear. Students score in the same range in math as they score in total. The last visualization in the bivariable section is looking at the math anxiety survey questions in relation to the math score.

In [52]:
#Visualize math anxiety qestions vs. math score
fig, ax = plt.subplots(nrows = 5, figsize = [14,8])

for i in range(len(categoric_vars)):
    var = categoric_vars[i]
    sb.violinplot(data = df, x = var, y = 'math_score', ax = ax[i],
               color = default_color)
    plt.tight_layout()

fig.suptitle('Math score per math anxiety question'.title(), y=1.02)
plt.show();

And the visualizations show a strong relation between anxiety and the math score. The more students disagree with questions the better they score in math.

  • Worry That It Will Be Difficult
  • Get Very Tense
  • Get Very Nervous
  • Feel Helpless
  • Worry About Getting Poor Grades

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

The more anxious students feel the lower they score in their math score. There is also a correlation between anxiety and gender. Female students feel more anxious towards math than male students.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

Math score and total score are linear to each other. Students are scoring on the same level in math and in total. Students that learn more hours outside of school do not score better in math or in total, so there is no strong correlation between out-of-school learning hours.

Multivariate Exploration¶

I am starting the multivariate exploration with a plot matrix of math score, total score, anxiety score and out-of-school learning hours.

In [57]:
#Plot matrices of the numeric varibales Math score, Total score, Anxienty score and Learning hours outside of school
numeric_var = ['math_score', 'total_score', 'ANXMAT', 'OUTHOURS']

g = sb.PairGrid(data = df, vars = numeric_var)
g.fig.set_size_inches(14, 8);
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)

plt.suptitle('Matrix of Math score, Total score, Math Anxiety score, Out-of-school learning hours'.title(), y=1.02);

There are no new findings which I have not already found in the bivariable exploration. Next I will have a look at the relation between total score and math score for two of the survey questions related to math anxiety.

In [62]:
def hist2dgrid(x, y, **kwargs):
    """ Quick hack for creating heat maps with seaborn's PairGrid. """
    palette = kwargs.pop('color')
    bins_x = np.arange(df['math_score'].min()//100, df['math_score'].max()+100, 100)
    bins_y = np.arange(df['total_score'].min()//100, df['total_score'].max()+100, 100)
    plt.hist2d(x, y, cmap = palette, bins=[bins_x, bins_y], cmin = 0.5)
In [63]:
#Visualization for question `Worry That It Will Be Difficult`
g = sb.FacetGrid(data = df, col = 'ST42Q01', col_wrap = 3, height = 3)
g.fig.set_size_inches(14, 8);
g.map(hist2dgrid, 'math_score', 'total_score', color = 'viridis_r')
g.set_xlabels('Math score')
g.set_ylabels('Total score')

plt.suptitle('Distribution of math and total score by math anxiety - Worry That It Will Be Difficult'.title(), y = 1.02)


plt.show()
In [64]:
#Visualization for question `Get very tense`
g = sb.FacetGrid(data = df, col = 'ST42Q03', col_wrap = 3, height = 3)
g.fig.set_size_inches(14, 8);
g.map(hist2dgrid, 'math_score', 'total_score', color = 'viridis_r')
g.set_xlabels('Math score')
g.set_ylabels('Total score')

plt.suptitle('Distribution of math and total score by math anxiety - Get very tense'.title(), y = 1.02)

plt.show()

The more students agree to questions related to math anxiety they are more likely to score lower in their math and total score. Next up is the relation between math score, total score and anxiety score.

In [61]:
#Anxiety in relation to math and total score
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [14, 8]);

plt.scatter(data = df, x = 'math_score', y = 'total_score', c = 'ANXMAT', s= 1, cmap = 'viridis_r')
plt.colorbar(label ='Math anxiety score')
plt.xlabel('Math score')
plt.ylabel('Total score')
plt.title('Math score vs. Total score vs. Math anxiety score');

There is a trend in scoring higher the less anxious students feel but not that clear just from this type of visualization. Next exploration is the relation between math score, out-of-school learning hours and the math anxiety score.

In [65]:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = [14, 8]);
plt.scatter(data = df, x = 'OUTHOURS', y = 'math_score', c = 'ANXMAT', s= 1,
               cmap = 'viridis_r')
plt.colorbar(label ='Math anxiety score')
plt.xlabel('Out-of-school learning hours')
plt.ylabel('Math score')
plt.title('Math score vs. Out-of-school learning hours vs. Math anxiety score');

In this visualization is a tendency visible that students that feel less anxious about math score higher in math. But even though some spent a lot of out-of-school learning hours they did not necessarily score higher in math. That is why I am looking into the correlation coefficients in the next visualization.

In [67]:
#Correlation heatmap for numeric variables Math score, Total score, Anxienty score and Learning hours outside of school
fig.set_size_inches(14, 8);
sb.heatmap(df[numeric_var].corr(), cmap = 'viridis_r', annot = True,
          fmt = '.2f', vmin = 0)
plt.suptitle('Correlation heatmap for Math score, Total score, Math anxiety score and Out-of-school learning hours'.title(), y = 1.02);

There is a strong linear correlation between total score and the math score. There is no correlation between learning hours outside of school and math and total score as well as the anxiety score towards math. There is a negative correlation of -0.36 and -0.38 between anxiety towards math and the total and math score. As a result I will not further look into the variable learning hours outside of school. Next I will explore the influence of gender on the math score and the anxiety score.

In [71]:
#Math anxienty score vs. math score per gender
g = sb.FacetGrid(data = df, col = 'ST04Q01')
g.fig.set_size_inches(14, 8);
g.map(plt.scatter, 'ANXMAT','math_score',  s=1)
g.set_xlabels('Anxiety score')
g.set_ylabels('Math score')

plt.suptitle('Math score vs. Anxiety score per gender'.title(), y = 1.02);

plt.show()

Also here you can see that female and male students have the same pattern. The more anxious they feel about math the lower they score in math but the visualization is not the best to see the difference.

In [252]:
# Math score in relation to the question Math Anxiety - Worry That It Will Be Difficult and gender'
fig = plt.figure(figsize = [14,6])
ax = sb.stripplot(data = df, x = 'ST42Q01', y = 'math_score', hue = 'ST04Q01',
           palette = 'Blues', size = 1, jitter = 0.3, dodge = True)
plt.title('Math score in relation to the question Math Anxiety - Worry That It Will Be Difficult and gender')
plt.legend(title ='Gender')
plt.ylabel('Math score')
plt.xlabel('Math Anxiety - Worry That It Will Be Difficult')

plt.show();
/Users/tanja/anaconda3/lib/python3.10/site-packages/IPython/core/pylabtools.py:152: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)

This visualization shows the answers to the question Worry That It Will Be Difficult in relation to the math score. In the visualization one can see the more students disagree with the survey question they score higher in math. There is also a difference between gender. Male students score higher than female students for each answer. The only exception is for Strongly disagree where female students have a higher math score than the male students.

In [253]:
# Math score in relation to the question Math Anxiety - Get very tense and gender
fig = plt.figure(figsize = [14,6])
ax = sb.stripplot(data = df, x = 'ST42Q03', y = 'math_score', hue = 'ST04Q01',
           palette = 'Blues', size = 1, jitter = 0.3, dodge = True)
plt.title('Math score in relation to the question Math Anxiety - Worry About Getting Poor Grades and gender')
plt.legend(title ='Gender')
plt.ylabel('Math score')
plt.xlabel('Math Anxiety - Get very tense')

plt.show();

This visualization shows the answers to the question Get very tense in relation to the math score. In the visualization one can see the more students disagree with the survey question they score higher in math. There is also a difference between gender. Male students score higher than female students for each answer.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

Female students score lower in math and they also feel more anxious about math. Male students score higher in math and they feel less anxious about math. Out-of-school learning hours do not have a strong influence on the math or total score.

Were there any interesting or surprising interactions between features?¶

Surpising for me was that the math score and total score have such a strong correlation and that math anxiety has an influence on the total score are well. I expected that there would be some students that are scoring very good in math but might score less in the other categories leading to a lower total score.

Conclusions¶

There is a strong relation between feeling anxious about math and scoring low in math or in total. The more students disagree with questions of the survey about math anxiety the better they scored in math. Female students feel more anxious about math than male students. Surprisingly out-of-school learning hours do not have an influence on the math or total score. Some students score on a high level with high numbers of out-of-school learning hours but there are also students that learn a lot of hours outside of school but still do not score on a high level.

In [ ]: